feat: [WIP] Add support for `COUNT(DISTINCT expr)` #2273

andygrove · 2025-09-01T17:24:18Z

Which issue does this PR close?

Closes #2292

Rationale for this change

Increase coverage of TPC-H benchmark.

What changes are included in this PR?

Add support for COUNT(DISTINCT expr)

How are these changes tested?

codecov-commenter · 2025-09-01T17:47:21Z

Codecov Report

❌ Patch coverage is 42.30769% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.43%. Comparing base (f09f8af) to head (6e124b4).
⚠️ Report is 449 commits behind head on main.

Files with missing lines	Patch %	Lines
...main/java/org/apache/comet/vector/CometVector.java	0.00%	15 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2273      +/-   ##
============================================
+ Coverage     56.12%   57.43%   +1.30%     
- Complexity      976     1287     +311     
============================================
  Files           119      146      +27     
  Lines         11743    13387    +1644     
  Branches       2251     2377     +126     
============================================
+ Hits           6591     7689    +1098     
- Misses         4012     4431     +419     
- Partials       1140     1267     +127

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead · 2025-09-02T03:58:05Z

That looks incredibly useful!

andygrove · 2025-09-03T14:36:34Z

I have test failures with columnar shuffle trying to call unimplemented method getLong on CometListVector. I suspect that this is related to writing shuffle output from a partial aggregate. I am putting this on hold for now and perhaps try and get this into the 0.11.0 release

andygrove · 2025-09-03T15:57:24Z

I understand the issue now. For COUNT(DISTINCT bool_col) the partial count outputs all of the distinct values for bool_col so we have a boolean list vector containing [[true], [false]]. Columnar shuffle writer does not understand this and thinks that the schema is a single LongType representing the final output from the count.

andygrove · 2025-09-03T16:09:21Z

I understand the issue now. For COUNT(DISTINCT bool_col) the partial count outputs all of the distinct values for bool_col so we have a boolean list vector containing [[true], [false]]. Columnar shuffle writer does not understand this and thinks that the schema is a single LongType representing the final output from the count.

If I fall back to Spark for the shuffle, I see the same issue. So the problem is perhaps that we report the wrong schema as the output from the partial aggregate

andygrove · 2025-09-03T17:16:06Z

I learned a lot from this WIP PR and have documented my findings in #2292 so I will close this for now.

andygrove added 4 commits September 1, 2025 10:54

Add support for COUNT(DISTINCT)

bba93bc

specialize

f5566f2

revert unrelated changes

8c927ac

improve fallback message

5e1be44

fix regression

d11342e

andygrove changed the title ~~feat: Add support for COUNT(DISTINCT expr)~~ feat: [WIP] Add support for COUNT(DISTINCT expr) Sep 3, 2025

andygrove added 3 commits September 3, 2025 09:13

improve exception message

f495046

improve exception message

c851cfa

upmerge

6e124b4

andygrove closed this Sep 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: [WIP] Add support for `COUNT(DISTINCT expr)` #2273

feat: [WIP] Add support for `COUNT(DISTINCT expr)` #2273

andygrove commented Sep 1, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 1, 2025 •

edited

Loading

Uh oh!

comphead commented Sep 2, 2025

Uh oh!

andygrove commented Sep 3, 2025

Uh oh!

andygrove commented Sep 3, 2025

Uh oh!

andygrove commented Sep 3, 2025

Uh oh!

andygrove commented Sep 3, 2025

Uh oh!

Uh oh!

feat: [WIP] Add support for COUNT(DISTINCT expr) #2273

feat: [WIP] Add support for COUNT(DISTINCT expr) #2273

Conversation

andygrove commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

comphead commented Sep 2, 2025

Uh oh!

andygrove commented Sep 3, 2025

Uh oh!

andygrove commented Sep 3, 2025

Uh oh!

andygrove commented Sep 3, 2025

Uh oh!

andygrove commented Sep 3, 2025

Uh oh!

Uh oh!

feat: [WIP] Add support for `COUNT(DISTINCT expr)` #2273

feat: [WIP] Add support for `COUNT(DISTINCT expr)` #2273

andygrove commented Sep 1, 2025 •

edited

Loading

codecov-commenter commented Sep 1, 2025 •

edited

Loading